In the first part of this analysis we try to use the Airbnb and Zillow datasets to understand which zip codes would generate the most profit on short term rentals within New York City for 2 bedroom apartments.
We use this information in the later steps to predict the nightly prices of the said apartments in these zipcodes using various Linear and non-linear models such as simple Linear Regression,Random Forest,XGBoost and Neural Nets
For this analysis two publicly available datasets from Zillow and AirBnB have been used:
Cost data: Zillow provides us an estimate of value for two-bedroom properties
Revenue data: AirBnB is the medium through which the investor plans to lease out their investment property. In this data we are able to see how much properties in certain neighborhoods rent out for in New York City
The data used for the price prediction part is a subset of the Airbnb data
For this analysis following are assumed(for Break Even analysis)
Occupancy rate as ~75%
Availability as 365 days
!pip install --upgrade pip
!pip install plotly
!pip install seaborn
!pip install matplotlib
!pip install matplotlib inline
!pip install folium
!pip install xgboost
!pip install keras
!pip install tensorflow==1.2.0 --ignore-installed
!pip install tensorflow
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import folium
import warnings
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import xgboost as xgb
from keras import models, layers, optimizers, regularizers
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
warnings.filterwarnings('ignore')
folder_path='C:/Users/Vishnu/Desktop/Data Folder'
os.chdir(folder_path)
path=os.getcwd()
costdata = "Zip_Zhvi_2bedroom.csv"
revenuedata = "listings.csv"
nbeds=2
availability=365
occupancy=0.75
city = ["New York"]
def read_data(path,file):
full_path = os.path.join(path,file)
return pd.read_csv(full_path)
revenue_data=read_data(path,revenuedata)
cost_data=read_data(path,costdata)
airbnb_data=revenue_data.copy()
zillow_data=cost_data.copy()
Null value check for datasets
Columns such as square_feet,weekly_price, monthly_price,notes,thumbnails etc have large number of missing values which can affect our analysis if we are to include these columns.
Columns which give the cost data for years from 1996 to 2010 have a considerable amount of values missing
airbnb_data.isnull().sum()
zillow_data.isnull().sum()
This function helps in getting more concrete insights on the percentage of missing values in both the datasets. Set the threshold missing value limit in the cell below and the function would return the columns from the datasets which have a missing value % greater than the set limit
Set the data missing% threshold below
missing_percent_threshold=20
def missing_data_list(df):
n=missing_percent_threshold
missing_values=round(df.isnull().sum()/len(df) * 100,2)
abnormal_cols= missing_values[missing_values>n]
return abnormal_cols
missing_data_list(zillow_data)
missing_data_list(airbnb_data)
Zillow Dataset: Since there is a considerable amount of missing values present in the Zillow dataset, only the last 1 year of cost data has been considered for the analysis.
zillow_df=zillow_data.loc[zillow_data['City'].isin(city)]
zillow_df=pd.concat([zillow_df[['RegionName','City', 'State', 'Metro', 'CountyName', 'SizeRank']],zillow_df.iloc[:,-12:]],axis = 1)
airbnb_df=airbnb_data.loc[airbnb_data['bedrooms']==nbeds]
zillow_df.RegionName = zillow_df.RegionName.astype(str)
airbnb_df=airbnb_df.dropna(subset=['zipcode'])
airbnb_df.zipcode=airbnb_df.zipcode.astype(float)
airbnb_df.zipcode=airbnb_df.zipcode.astype(int).astype(str)
def merge_data(cost_data,price_data,costdata_col,pricedata_col):
merged_data = cost_data.merge(price_data, how='inner', left_on=costdata_col, right_on=pricedata_col)
merged_data=merged_data.drop_duplicates()
return merged_data
final_data=merge_data(zillow_df,airbnb_df,'RegionName','zipcode')
final_data.groupby(['RegionName','neighbourhood_group_cleansed']).size()
Its observed that one instance of the zipcode '10013' is erroneously mapped to Brooklyn in the dataset. On examining the neighbourhood and location of this 10013-Brooklyn data, it is found that it belongs to 'Bergen Beach' neighbourhood which is actually in Brooklyn. But remapping the zipcode at this point would result in errors when the prices of the apartments are calculated as the prices are already merged from zillow. So this one erroneous record of Bergen Beach, Brooklyn with zipcode '11234' can be removed, given the complexity and time constraint involved in fixing it otherwise.
*Several such cases can be found in the original airbnb data
final_data[((final_data.RegionName == '10013') & (final_data.neighbourhood_group_cleansed == 'Brooklyn'))][['neighbourhood','longitude','latitude']]
ind=final_data[((final_data.RegionName == '10013') & (final_data.neighbourhood_group_cleansed == 'Brooklyn'))].index
final_data=final_data.drop(ind)
final_data.groupby(['RegionName','neighbourhood_group_cleansed']).size()
The missing data list obtained using the missing_data_list function helps in deciding if a column needs to be retained/dropped/transformed for our analysis. For this analysis columns only relevant columns required for the analysis have been retained. The missing value issues can be taken up at a later stage
final= pd.concat([final_data.iloc[:,0:18],final_data[['id','neighbourhood_group_cleansed','city', 'state', 'latitude', 'longitude', 'price', 'weekly_price',
'monthly_price', 'security_deposit', 'cleaning_fee', 'availability_365', 'number_of_reviews','host_id','host_is_superhost']]],axis=1)
The price values are found to have '$' and ',' which needs to be removed before the computation steps which follow.
final[['monthly_price','weekly_price','security_deposit','price','cleaning_fee']].head()
def cleanprices(row):
row_str=row.astype(str)
row_replace=row_str.str.replace(',', '')
row_clean=row_replace.str.replace('$', '')
row_clean=row_clean.astype(float)
return row_clean
final[['monthly_price','weekly_price','security_deposit','price','cleaning_fee']]=final[['monthly_price','weekly_price','security_deposit','price','cleaning_fee']].apply(cleanprices)
Missing values : As mentioned above there are quite a few columns with substantial amount of missing values including a few which could have otherwise played an important part in the analysis:
Some zipcodes had length other than 5 in the original dataset, eventhough such cases are not present in the scenario considered for the analysis
Timeline issue: The airbnb data was last scraped in the third quarter of the FY 2019 which is not the case with Zillow data which provides property cost prices only uptill June 2017. We could forecast these recent cost prices using a time series model for a more robust analysis but this has been proposed as a future step. However, in the current analysis the cost prices have been assumed to be latest prices as time value of money discount rate is assumed to be 0%
Data cleaning was performed by
final= final.rename(columns={'RegionName': 'Zipcodes', 'neighbourhood_group_cleansed': 'Neighbourhood'})
print("The Final data dimensions are "+ str(final.shape))
Table showing number of listings by Neighbourhood, Population(SizeRank) Average Price and No of listings in each ZipCode
final_analysis_1=final.groupby(['Neighbourhood','Zipcodes','SizeRank']).agg({'price':'mean','id':'nunique'}).rename(columns={'price': 'Avg_price','id':'No.of listings'}).round({'Avg_price':0}).sort_values(['SizeRank','No.of listings'],ascending=True).reset_index()
final_analysis_1
final.price.describe().round()
plt.figure(figsize=(20,12))
bplot = sns.boxplot(y='price', x='Zipcodes',
data=final,
width=0.5,
palette="colorblind")
bplot.axes.set_title("Boxplots of Listing Prices by ZipCodes in NY without Outlier Treatment",fontsize=16)
bplot.set_xlabel("Zipcodes", fontsize=14)
bplot.set_ylabel("Price",fontsize=14)
bplot.tick_params(labelsize=12)
Function to cap outliers in prices that are abnormally high or low
def outlier(col):
percentiles = col.quantile([0.01,0.99]).values
col = np.clip(col, percentiles[0], percentiles[1])
return col
final[['price']]=final[['price']].apply(outlier)
final.price.describe().round()
plt.figure(figsize=(20,12))
bplot = sns.boxplot(y='price', x='Zipcodes',
data=final,
width=0.5,
palette="colorblind")
bplot.axes.set_title("Boxplots of Listing Prices by ZipCodes in NY without Outlier Treatment",fontsize=16)
bplot.set_xlabel("Zipcodes",fontsize=14)
bplot.set_ylabel("Price",fontsize=14)
bplot.tick_params(labelsize=12)
grouped_values=final.groupby(['Neighbourhood','Zipcodes']).agg({'price':'mean'}).sort_values(by='price',ascending=False).reset_index()
plt.figure(figsize=(15,12));
bar=sns.factorplot(x='Zipcodes',y='price',hue='Neighbourhood',data=grouped_values,
size=6, aspect=2,
kind='bar',
dodge=False,legend_out=False);
bar.ax.set_title("ZipCodes by Average Price of the listings in that ZipCode",fontsize=16);
bar.set_xlabels("ZipCodes",fontsize=14).add_legend(title='Neighbourhood');
bar.set_ylabels("Daily Price of the listings",fontsize=14);
cost_trend=pd.concat([final[['Zipcodes','Neighbourhood']],final.iloc[:,6:18]],axis = 1)
cost_trend=cost_trend.melt(id_vars=['Zipcodes','Neighbourhood'],var_name="Year",value_name="Median Cost")
cost_trend=cost_trend.drop_duplicates()
cost_trend.Zipcodes=cost_trend.Zipcodes.astype(float).astype(int).astype(str)
df = cost_trend
fig = px.line(df, x="Year", y="Median Cost", color="Zipcodes", facet_row="Neighbourhood",width=800,height=800)
fig.for_each_annotation(lambda a: a.update(text=a.text.replace("Neighbourhood=", "")))
fig.for_each_trace(lambda t: t.update(name=t.name.replace("Zipcodes=", "")))
fig.update_layout(showlegend=True)
fig.update_xaxes(ticks="inside",nticks=20,showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(ticks="inside", col=1,showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_layout(legend=dict(x=0, y=-0.2))
fig.update_layout(legend_orientation="h")
fig.update_traces(hoverinfo='text+name', mode='lines+markers')
fig.update_layout(
title={
'text': "Cost Trend of the Properties by ZipCode for last 12 Months",
'y':0.96,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="Year",
yaxis_title="Median cost",
font=dict(family="Courier New, monospace",size=10,color="#7f7f7f"))
fig.show()
avg_cost_latest = zillow_df.copy()
avg_cost_latest = avg_cost_latest.loc[:, ['RegionName']]
avg_cost_latest['avg_cost'] = zillow_df[zillow_df.columns[-6:]].mean(axis =1)
final_data_cost=merge_data(avg_cost_latest,final,'RegionName','Zipcodes')
grouped_values_cost=final_data_cost.groupby(['Neighbourhood','Zipcodes']).agg({'avg_cost':'mean'}).sort_values(by='avg_cost',ascending=False).reset_index()
plt.figure(figsize=(15,12))
bar=sns.factorplot(x='Zipcodes',y='avg_cost',hue='Neighbourhood',data=grouped_values_cost,
size=6, aspect=2,
kind='bar',
dodge=False,legend_out=False)
bar.ax.set_title("ZipCodes by Average Cost of the listings in that ZipCode",fontsize=16)
bar.set_xlabels("ZipCodes", fontsize=14).add_legend(title='Neighbourhood')
bar.set_ylabels("Average Cost of the listings",fontsize=14)
The Zipcodes have been split across four quadrants to identify profitable ones
High cost High return zipcodes: These are the Zipcodes that are profitable but do not make very high profits due to the high costs involved at the time of purchase. Most of the Manhattan properties fall in this category Example ZipCodes:
Low Cost High return zipcodes:These are highly recommended ZipCodes as they make high profits with lesser investment. However not many zipcodes fall into this category Example zipcodes:
Low cost Low return zipcodes:These Zipcodes are recommended after the ones in the Second Quadrant. they involve lesser cost and lesser or higher price/revenue. Example zipcodes:
High cost Low return zipcodes: There are no ZipCodes this belong to this Quadrant
final_data_cost['Normalized_price'] = (final_data_cost.price-final_data_cost.price.mean())/final_data_cost.price.std()
final_data_cost['Normalized_cost'] = (final_data_cost.avg_cost-final_data_cost.avg_cost.mean())/final_data_cost.avg_cost.std()
df = final_data_cost
grouped_values_bubble=final_data_cost.groupby(['Neighbourhood','Zipcodes']).agg({'Normalized_cost':'mean','Normalized_price':'mean'}).reset_index()
fig = px.scatter(grouped_values_bubble, x="Normalized_cost", y="Normalized_price", color="Neighbourhood",
hover_data=['Zipcodes'],size_max=60,width=800, height=600)
fig.for_each_trace(lambda t: t.update(name=t.name.replace("Neighbourhood=", "")))
fig.update_traces(mode="markers",marker=dict(size=12))
fig.update_layout(
title={
'text': "Cost Benefit Analysis of Zipcodes",
'y':0.96,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="Normalized Cost",
yaxis_title="Normalized Price",
font=dict(
family="Courier New, monospace",
size=12,
color="#7f7f7f"))
fig.update_layout(autosize=False,width=800,height=600)
fig.add_shape(
go.layout.Shape(
type="line",
x0=0,
y0=2,
x1=0,
y1=-2,
line=dict(
color="grey",
width=3,dash="dashdot"
)
))
fig.add_shape(
go.layout.Shape(
type="line",
x0=-3,
y0=-0,
x1=3,
y1=0,
line=dict(
color="grey",
width=4,
dash="dashdot",
),
))
fig.show()
The break-even point is the point at which total cost and total revenue are equal. There is no net loss or gain. At break-even point revenues are equal to exactly all of the expenses on a single income statement prepared under the accrual method of accounting.
The break-even point for this business case can be determined as given below
breakeven_analysis =final_data_cost.groupby(['Neighbourhood','Zipcodes']).agg({'price':'mean','avg_cost':'mean'}).rename(columns={'price': 'Avg_price','avg_cost':'Mean_cost'}).round({'Avg_price':2,'Mean cost':2}).reset_index()
breakeven_analysis['Breakeven_years']= round(breakeven_analysis.Mean_cost/(breakeven_analysis.Avg_price*availability*occupancy),2)
The below graph shows the breakeven points(in years) sorted from lowest to highest based on the neighbourhoods.
breakeven_analysis["Zip-Neighbourhood"] = breakeven_analysis["Zipcodes"].map(str)+ '-'+breakeven_analysis["Neighbourhood"].str[:1]
fig = px.bar(breakeven_analysis, x="Zip-Neighbourhood", y="Breakeven_years",
hover_data=['Neighbourhood'], color='Neighbourhood',
height=500,width=1000)
fig.for_each_trace(lambda t: t.update(name=t.name.replace("Neighbourhood=", "")))
fig.update_xaxes(categoryorder="total ascending")
fig.update_layout(
title={
'text': "BreakEven point analysis by Zipcode",
'y':0.96,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="Zip-Neighbourhood",
yaxis_title="BreakEven Time in Years",
font=dict(
family="Courier New, monospace",
size=12,
color="#7f7f7f"))
fig.update_layout(legend_orientation="v")
fig.show()
ZIP Codes with bargraphs isn't a very intuitive way to present results by itself. Presenting the the same on a map would help the real estate company visually understand the results better.
The size of the bubble is inversely proportional to the breakeven value in years, The larger the size of the bubbles the shorter is the time to breakeven and vice-versa. The bubbles can be clicked to see all the information associated with each point.
map_data =final_data_cost.groupby(['Neighbourhood','Zipcodes']).agg({'price':'mean','avg_cost':'mean','latitude':'mean','longitude':'mean'}).rename(columns={'price': 'Avg_price','avg_cost':'Mean_cost','latitude':'Latitude','longitude':'Longitude'}).round({'Avg_price':2,'Mean cost':2}).reset_index()
map_data['Breakeven_years']= round(map_data.Mean_cost/(map_data.Avg_price*availability*occupancy),2)
folium_map_Breakeven = folium.Map(location=[40.67, -74],
zoom_start=11,
tiles="CartoDB dark_matter")
for index,row in map_data.iterrows():
popup_text = "Zipcode: {}<br> Neighbourhood: {}<br> Breakeven years: {}"
popup_text = popup_text.format(row["Zipcodes"],row["Neighbourhood"],row["Breakeven_years"])
if row["Neighbourhood"]=='Manhattan':
color="#FFCE00" # orange
elif row["Neighbourhood"]=='Queens':
color="#E37222"
elif row["Neighbourhood"]=='Staten Island':
color="#0375B4" # blue
else :
color="#0A8A9F" # teal
radius= row['Breakeven_years']
folium.CircleMarker(location=(row["Latitude"],row["Longitude"]),radius=-radius+35,color=color,popup=popup_text,
fill=True).add_to(folium_map_Breakeven)
folium_map_Breakeven
Venturing into a market without understanding the competition is risky. Identifying our competitors is important before we finalize our business strategies. It is vital to the success of a new or existing business because it reduces risk, time, required resources, and expense.
As per Airbnb, a super host is someone who provides a shining example for other hosts, and provide extraordinary experiences for their guests. So, it is important to look where our competition is before we start investing.
Rather than looking at the number of superhosts, an analysis of their proportion out of the total listings would give us a better idea of the competition.
final_org=final.groupby(['Zipcodes','Neighbourhood']).agg({'latitude':'mean','longitude':'mean','host_is_superhost':'count'}).rename(columns={'latitude':'Latitude','longitude':'Longitude'}).reset_index()
final_sh=final[final.host_is_superhost=='t']
final_sh=final_sh.groupby(['Zipcodes','Neighbourhood'])['host_is_superhost'].count().reset_index()
final_competitors = final_org.merge(final_sh, how='inner', left_on='Zipcodes', right_on='Zipcodes')
final_competitors=final_competitors.drop_duplicates()
final_competitors = pd.merge(final_org,final_sh[['Zipcodes','host_is_superhost']],on='Zipcodes', how='inner')
final_competitors=final_competitors.rename(columns={'host_is_superhost_x':'Total_listings','host_is_superhost_y':'Total_super_hosts'})
final_competitors['Percentage_of_super_hosts']= round(final_competitors.Total_super_hosts/final_competitors.Total_listings * 100,2)
folium_map_competitors = folium.Map(location=[40.67, -74],
zoom_start=11,
tiles="CartoDB dark_matter")
for index,row in final_competitors.iterrows():
popup_text = "Zipcode: {}<br> Neighbourhood: {}<br> Listings: {}<br> Superhosts: {}<br> Superhosts %: {}"
popup_text = popup_text.format(row["Zipcodes"],row["Neighbourhood"],row["Total_listings"],row["Total_super_hosts"],row["Percentage_of_super_hosts"])
if row["Neighbourhood"]=='Manhattan':
color="#FFCE00" # orange
elif row["Neighbourhood"]=='Queens':
color="#E37222"
elif row["Neighbourhood"]=='Staten Island':
color="#0375B4" # blue
else :
color="#0A8A9F" # teal
radius= row['Percentage_of_super_hosts']
folium.CircleMarker(location=(row["Latitude"],row["Longitude"]),radius=radius/2,color=color,popup=popup_text,
fill=True).add_to(folium_map_competitors)
folium_map_competitors
A thorough study of the data from the different sources was performed followed by data munging and cleaning steps
A basic exploratory data analysis was performed on the cleaned dataset to derive insights
Created different plots to understand the price and cost distribution
Performed a Price vs Cost Quadrant Analysis to understand the high-risk and high-return ZipCodes
Performed Breakeven Analysis to understand the ZipCodes which recoup the investment the earliest
Performed a Competitve Analysis to understand the competiton and identify highly competitive Zipcodes before investing
#We will use the final_data for the analysis
final_data.head()
final_data.columns
# subsetting the data to include only useful columns
model_data=final_data[['price','security_deposit','cleaning_fee','extra_people','host_response_rate','host_is_superhost','host_listings_count','host_has_profile_pic','host_identity_verified','neighbourhood_group_cleansed','city','state','property_type','room_type','accommodates','bathrooms','bedrooms','beds','guests_included','number_of_reviews','review_scores_value','instant_bookable','is_business_travel_ready','reviews_per_month']]
# investigating the column types
model_data.dtypes
It is observed that the price columns need to be converted as before. Also few other columns would need some transformation to make them fit for the analysis
missing_data_list(model_data)
model_data.host_response_rate = model_data.host_response_rate.str[:-1].astype('float64')
#Analyzing the host response rate column to bin and understand how the data is spread
print("Median host response rate:", model_data['host_response_rate'].median())
print(f"Proportion of 100% host response rates: {round(((model_data.host_response_rate == 100.0).sum()/model_data.host_response_rate.count())*100,1)}%")
# Bin into four categories
model_data.host_response_rate = pd.cut(model_data.host_response_rate, bins=[0, 50, 90, 99, 100], labels=['0-49%', '50-89%', '90-99%', '100%'], include_lowest=True)
# Converting to string to make it categories
model_data.host_response_rate = model_data.host_response_rate.astype('str')
model_data.host_response_rate.replace('nan', 'unknown', inplace=True)
#Check nulls in host related columns
len(model_data[model_data.loc[ :,['host_is_superhost', 'host_listings_count', 'host_has_profile_pic', 'host_identity_verified'] ].isnull().sum(axis=1) == 5])
#The property type can be grouped to reduce the no of levels
model_data.property_type.value_counts()
# Replacing categories that are types of houses or apartments
model_data.property_type.replace({
'Townhouse': 'House',
'Serviced apartment': 'Apartment',
'Loft': 'Apartment',
'Bungalow': 'House',
'Cottage': 'House',
'Villa': 'House',
'Tiny house': 'House',
'Earth house': 'House',
'Chalet': 'House'
}, inplace=True)
model_data.loc[~model_data.property_type.isin(['House', 'Apartment']), 'property_type'] = 'Other'
#imputing with median for null values in columns such as bathrooms, bedrooms and beds
for col in ['bathrooms', 'bedrooms', 'beds']:
model_data[col].fillna(model_data[col].median(), inplace=True)
#cleaning price columns
model_data[['security_deposit','price','cleaning_fee','extra_people']]=model_data[['security_deposit','price','cleaning_fee','extra_people']].apply(cleanprices)
model_data[['security_deposit']].dtypes
#cleaning up columns - imputing 0
model_data.security_deposit.fillna(0, inplace=True)
model_data.cleaning_fee.fillna(0, inplace=True)
model_data.extra_people.fillna(0, inplace=True)
model_data.isna().sum()
model_data.dropna(subset = ["host_is_superhost"], inplace=True)
model_data.isna().sum()
#Function to bin columns
def binning(col, bins, labels, na_label='unknown'):
model_data[col] = pd.cut(model_data[col], bins=bins, labels=labels, include_lowest=True)
model_data[col] = model_data[col].astype('str')
model_data[col].fillna(na_label, inplace=True)
binning('review_scores_value',
bins=[0, 8, 9, 10],
labels=['0-8/10', '9/10', '10/10'],
na_label='no reviews')
model_data.drop(['reviews_per_month','city','state'], axis=1, inplace=True)
model_data.isna().sum()
to_drop = ['bedrooms','neighbourhood_group_cleansed']
model_data.drop(to_drop, axis=1, inplace=True)
transformed_data = pd.get_dummies(model_data,drop_first=True)
transformed_data.head()
def mc_heatmap(df, figsize=(20,20)):
sns.set(style="whitegrid")
corr = df.corr() #co-variance matrix
# Generate a mask the size of our covariance matrix
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=figsize)
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, vmax=corr[corr != 1.0].max().max());
mc_heatmap(transformed_data)
#log transforming the numerical columns as they found to be skewed
numerical_columns = ['accommodates', 'bathrooms', 'cleaning_fee', 'extra_people', 'host_listings_count', 'number_of_reviews', 'price', 'security_deposit','beds','guests_included']
for col in numerical_columns:
transformed_data[col] = transformed_data[col].astype('float64').replace(0.0, 0.01) # Replacing 0s with 0.01
transformed_data[col] = np.log(transformed_data[col])
#Defining predictors and response variables
X = transformed_data.drop('price', axis=1)
y = transformed_data.price
# Scaling the predictors using StandardScaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=list(X.columns))
#Checking for multicollinearity using Variance Inflation Factor (values>10 should be checked for mc)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif.round(1)
# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=20)
Lmodel=LinearRegression(fit_intercept=True).fit(X_train,y_train)
Lmodel.coef_,Lmodel.intercept_
predicted_values=Lmodel.predict(X_test)
mean_squared_error(y_test,predicted_values)
training_regr = Lmodel.predict(X_train)
test_regr = Lmodel.predict(X_test)
print("\nTraining MSE:", round(mean_squared_error(y_train, training_regr),4))
print("Validation MSE:", round(mean_squared_error(y_test, test_regr),4))
RF = RandomForestRegressor(n_estimators = 60)
RF.fit(X_train,y_train)
training_rfreg = RF.predict(X_train)
val_preds_rfreg = RF.predict(X_test)
print("\nTraining MSE:", round(mean_squared_error(y_train, training_rfreg),4))
print("Validation MSE:", round(mean_squared_error(y_test, val_preds_rfreg),4))
xgb_regr = xgb.XGBRegressor()
xgb_regr.fit(X_train, y_train)
xgb_reg_train = xgb_regr.predict(X_train) #on train set
# Validate
xgb_reg_test = xgb_regr.predict(X_test) # on test set
print("\nTraining MSE:", round(mean_squared_error(y_train, xgb_reg_train),4))
print("Validation MSE:", round(mean_squared_error(y_test, xgb_reg_test),4))
#2 hidden layers
nnet_2 = models.Sequential()
nnet_2.add(layers.Dense(128, input_shape=(X_train.shape[1],), activation='relu'))
nnet_2.add(layers.Dense(256, activation='relu'))
nnet_2.add(layers.Dense(256, activation='relu'))
nnet_2.add(layers.Dense(1, activation='linear')) #linear activation since its regression
# Compiling the model and summary
nnet_2.compile(loss='mean_squared_error',optimizer='adam',metrics=['mean_squared_error'])
print(nnet_2.summary())
# Model Training and validating on 10%
nnet2_train = nnet_2.fit(X_train,
y_train,
epochs=150,
batch_size=256,
validation_split = 0.1)
# MSE values
nnet2_test_pred = nnet_2.predict(X_test)
nnet2_train_pred = nnet_2.predict(X_train)
print("Training MSE:", round(mean_squared_error(y_train, nnet2_train_pred),4))
print("Validation MSE:", round(mean_squared_error(y_test, nnet2_test_pred),4))
Neural Net with 3 hidden layers and other configurations
# neural net with 3 hidden layers
nnet_3 = models.Sequential()
nnet_3.add(layers.Dense(128, input_shape=(X_train.shape[1],), kernel_regularizer=regularizers.l1(0.005), activation='relu'))
nnet_3.add(layers.Dense(256, kernel_regularizer=regularizers.l1(0.005), activation='relu'))
nnet_3.add(layers.Dense(256, kernel_regularizer=regularizers.l1(0.005), activation='relu'))
nnet_3.add(layers.Dense(512, kernel_regularizer=regularizers.l1(0.005), activation='relu'))
nnet_3.add(layers.Dense(1, activation='linear'))
# Model compilation
nnet_3.compile(loss='mean_squared_error',
optimizer='adam',
metrics=['mean_squared_error'])
print(nnet_3.summary())
# Training the model
nnet_3_train = nnet_3.fit(X_train,
y_train,
epochs=150,
batch_size=256,
validation_split = 0.1)
# MSE and r squared values
nnet_3_test_pred = nnet_3.predict(X_test)
nnet_3_train_pred = nnet_3.predict(X_train)
print("Training MSE:", round(mean_squared_error(y_train, nnet_3_train_pred),4))
print("Validation MSE:", round(mean_squared_error(y_test, nnet_3_test_pred),4))
From a low-risk and high-return on investment standpoint the zipcodes 10025 and 10036 in Mahattan are good investment options
From a low-risk and moderate-return standpoint, I would recommend the zipcode 11434 in Queens and the zipcodes 10306 and 10305 in Staten Island as potentially good avenues of investment
From a breakeven perspective, I would recommend zipcode 11434 in Queens as the best ZipCode to invest in, followed by 10306 and 10303 in Staten Islands
From a competitive analysis perspective, it is recommeneded the client should exercise caution while investing in Queens or Staten Islands as these neighbourhoods have a significant proportion of listings that provide high quality stay to its customers
Its observed that out of the models tested the Neural net model with three hidden layers is by far the best one. It has the lowest test MSE than the Simpler Linear Regression, Random Forest, XGBoost Model and the neural net configuration with two hidden layers